7 research outputs found

    Machine Learning Techniques for Evolving Threats

    Get PDF

    TESSERACT:Eliminating Experimental Bias in Malware Classification across Space and Time

    Get PDF
    Is Android malware classification a solved problem? Published F1 scores of up to 0.99 appear to leave very little room for improvement. In this paper, we argue that results are commonly inflated due to two pervasive sources of experimental bias: "spatial bias" caused by distributions of training and testing data that are not representative of a real-world deployment; and "temporal bias" caused by incorrect time splits of training and testing sets, leading to impossible configurations. We propose a set of space and time constraints for experiment design that eliminates both sources of bias. We introduce a new metric that summarizes the expected robustness of a classifier in a real-world setting, and we present an algorithm to tune its performance. Finally, we demonstrate how this allows us to evaluate mitigation strategies for time decay such as active learning. We have implemented our solutions in TESSERACT, an open source evaluation framework for comparing malware classifiers in a realistic setting. We used TESSERACT to evaluate three Android malware classifiers from the literature on a dataset of 129K applications spanning over three years. Our evaluation confirms that earlier published results are biased, while also revealing counter-intuitive performance and showing that appropriate tuning can lead to significant improvements.Comment: This arXiv version (v4) corresponds to the one published at USENIX Security Symposium 2019, with a fixed typo in Equation (4), which reported an extra normalization factor of (1/N). The results in the paper and the released implementation of the TESSERACT framework remain valid and correct as they rely on Python's numpy implementation of area under the curv

    Transcend:Detecting Concept Drift in Malware Classification Models

    Get PDF
    Building machine learning models of malware behavior is widely accepted as a panacea towards effective malware classification. A crucial requirement for building sustainable learning models, though, is to train on a wide variety of malware samples. Unfortunately, malware evolves rapidly and it thus becomes hard—if not impossible—to generalize learning models to reflect future, previously-unseen behaviors. Consequently, most malware classifiers become unsustainable in the long run, becoming rapidly antiquated as malware continues to evolve. In this work, we propose Transcend, a framework to identify aging classification models in vivo during deployment, much before the machine learning model’s performance starts to degrade. This is a significant departure from conventional approaches that retrain aging models retrospectively when poor performance is observed. Our approach uses a statistical comparison of samples seen during deployment with those used to train the model, thereby building metrics for prediction quality. We show how Transcend can be used to identify concept drift based on two separate case studies on Android andWindows malware, raising a red flag before the model starts making consistently poor decisions due to out-of-date training

    Prescience:Probabilistic Guidance on the Retraining Conundrum for Malware Detection

    Get PDF
    Malware evolves perpetually and relies on increasingly sophisticatedattacks to supersede defense strategies. Datadrivenapproaches to malware detection run the risk of becomingrapidly antiquated. Keeping pace with malwarerequires models that are periodically enriched with freshknowledge, commonly known as retraining. In this work,we propose the use of Venn-Abers predictors for assessingthe quality of binary classification tasks as a first step towardsidentifying antiquated models. One of the key bene-fits behind the use of Venn-Abers predictors is that they areautomatically well calibrated and offer probabilistic guidanceon the identification of nonstationary populations ofmalware. Our framework is agnostic to the underlying classificationalgorithm and can then be used for building betterretraining strategies in the presence of concept drift. Resultsobtained over a timeline-based evaluation with about 90Ksamples show that our framework can identify when modelstend to become obsolete
    corecore